The following content has been provided by the University of Erlangen-Nürnberg.
So I'm also very grateful to come here. It's the first time in Erlangen.
It's also because our activities split a bit. We were working on the same thing in 92, 96, and then we had different activities.
And the last three or four years, I went back to political optimizations that are also useful here.
So this is why we meet again and hope we could work together in the near future.
So today the talk is about optimizing remote accesses for kernels that you offload on some platform,
but basically for high-level synthesis for FPGA. Maybe it could be useful in other contexts.
So just a few words about the team. We have a small group of four people with two external collaborators.
That means I come sometimes, but not always. So Christophe Alias, myself, Paul Fautrier, that you have heard about, I guess, and Fabrice Rastello.
So we have two main activities, backend code optimizations. We worked a lot on aggressive optimization for assembly code level
and a lot of on register allocation and static single assignment forms.
So we have a lot of results on this area. Maybe you have heard of that, but it's not the topic of today.
We also work a lot on high-level analysis and transformations, trying to exploit nested loops.
So working on software pipelining, multidimensional software pipelining, nested loop scheduling, tiling,
optimization of buffer, which comes from shared memory, so beyond FIFOs.
And we also develop our own polydoll tools, also some algorithm based on graph, all our activity on backend code optimizations.
And now we are trying to move slowly towards streaming languages. So we'll see what we do in the next four years.
So the talk today is, so I'm already on, I skipped that.
Oh yes, you wanted me to tell a few words about me. So I did a lot on automatic parallelization.
I worked a lot on systemic redesign, then automatic parallelization, extraction of parallelism from loops, basically,
then a bit of software pipelining, then, as I said, on backend code optimizations, static single assignment, register allocation.
And now again, on high-level transformation for high-level synthesis.
And also a book on parallelization.
Yes, that's right. So the talk will be, I want to explain the motivation of trying to do high-level transformation for high-level synthesis.
Then I will give the overview of the compilation scheme that we have for exploiting the communication and try to automate double buffering.
This has something to do with something which is known as communication coalescing.
So I will just remind you a few things.
And then I will enter the details. This will be at the end of the talk, so more technical, how we can automate this to have automatic data reuse between tiles, inside tiles,
when you offload a tile kernel into FPGA.
And I will give a few examples.
OK, so we are considering high-level synthesis tools from a high-level perspective, even higher than high-level synthesis.
Because we think that if you look at the industrial tools and academic tools that exist, like Spark, Goat, UG, a list I put here, like Pico Express, Sparrow, Catapult C,
now they are quite good at optimizing the heart of the computation.
That means they can optimize finite state machine, exploit instruction-level parallelism, they can do resource sharing, resource allocation, all the traditional optimization in compilers.
It's not as true for parallelism, but at least for one sequential part, it's really well optimized.
But strangely, the designers prefer to ignore these tools and are still coding VHDL.
So one of the reasons is that in some of these tools, the interface with the outside world is not really well managed.
Sometimes in some tools, you can just have no way of interfacing with the outside world because there is no semantics on the input and output.
Sometimes you have to do it yourself, so you have to write VHDL glue.
Sometimes there is no communication optimization, so you have to redesign the algorithm.
And to do this, you need a lot of knowledge.
So what we want to do is to find automatic ways of feeding such accelerators with data.
And one of our assumptions is that we are going to focus on accelerators that are limited by the bandwidth.
So we assume that for the sequential part, with some ILP inside, we are going to put all the hardware which is necessary to be able to perform one iteration of a loop in one cycle.
So everything is going to be bounded by the bandwidth throughput.
So you have to keep this in mind.
And what we want to do is to push all the dirty work into the back-end compiler.
That means we don't want to write additional code in VHDL.
We may write additional code, but this additional code has to be compiled by the same high-level synthesis tools itself.
So it's an exercise.
Presenters
PD Dr. Alain Darte
Zugänglich über
Offener Zugang
Dauer
00:50:48 Min
Aufnahmedatum
2012-05-18
Hochgeladen am
2012-05-18 18:07:05
Sprache
de-DE
Some data- and compute-intensive applications can be accelerated by offloading portions of codes to platforms such as GPGPUs or FPGAs. However, to get high performance for these kernels, it is mandatory to restructure the application, to generate adequate communication mechanisms for the transfer of remote data, and to make good usage of the memory bandwidth. In the context of the high-level synthesis (HLS), from a C program, of hardware accelerators on FPGA, we show how to automatically generate optimized remote accesses for an accelerator communicating to an external DDR memory. Loop tiling is used to enable block communications, suitable for DDR memories. Pipelined communication processes are generated to overlap communications and computations, thereby hiding some latencies, in a way similar to double buffering. Finally, data reuse among tiles is exploited to avoid remote accesses when data are already available in the local memory.